Predicting the Severity of Disease Progression in COVID-19 at the Individual and Population Level: A Mathematical Model

The impact of COVID-19 disease on health and economy has been global, and the magnitude of devastation is unparalleled in modern history. Any potential course of action to manage this complex disease requires the systematic and efficient analysis of data that can delineate the underlying pathogenesis. We have developed a mathematical model of disease progression to predict the clinical outcome, utilizing a set of causal factors known to contribute to COVID-19 pathology such as age, comorbidities, and certain viral and immunological parameters. Viral load and selected indicators of a dysfunctional immune response, such as cytokines IL-6 and IFNα which contribute to the cytokine storm and fever, parameters of inflammation D-Dimer and Ferritin, aberrations in lymphocyte number, lymphopenia, and neutralizing antibodies were included for the analysis. The model provides a framework to unravel the multi-factorial complexities of the immune response manifested in SARS-CoV-2 infected individuals. Further, this model can be valuable to predict clinical outcome at an individual level, and to develop strategies for allocating appropriate resources to manage severe cases at a population level.


INTRODUCTION
The COVID-19 pandemic caused by infection with SARS-CoV-2 was officially announced in March 2020 by the CDC and WHO [1,2]. As of this publication, more than 170 million infections and over 3.5 million deaths have been reported worldwide. Majority of the subjects have asymptomatic infections. The rate of fatality is disproportionately high in the elderly and patients with comorbidities such as diabetes, cardiac disease, and kidney disease [3,4]. The consequences of the pandemic are fraught with potential loss of lives, social and economic distress, and the uncertainty of disease progression because of variable individual pathogenesis.
A unique and dysregulated immune response has been shown to be a hallmark of COVID-19 [5][6][7][8][9]. The figure 1 schematically depicts the cascade of events that contribute to the progression of disease. Mathematical models have been utilized by several investigators to understand the mechanisms of disease pathogenesis, immune pathways involved and course of viral infections [10,11]. In this article, we have proposed a predictive model that utilizes the levels of clinical and laboratory parameters to determine the severity of clinical outcomes ranging from asymptomatic to mild, moderate, severe, and critical disease states. The proposed model can be useful to predict clinical outcome at the individual-level and develop efficient and effective treatment strategies to manage public health challenges at the population-level. The questions the model attempts to answer are: at an individual level, what is the probability of an individual infected with SARS-CoV-2, given the clinical signs and laboratory values on various days, likely to progress to severe disease; at a population level, what are the prioritized clinical and laboratory parameters that are most likely to contribute to progression to severe disease. We have used a multiple regression based model to predict severity of the outcome of COVID-19. To evaluate the combinatorics that are not observed in the sample, we have applied resampling methods based on Monte Carlo simulation (Figure 3a and 3b).

Development of a simulated dataset
A simulated data set of 45 individual subjects was created with 15 subjects assumed to be asymptomatic, 15 with moderate disease, and 15 with severe COVID-19 [12,13]. The simulated values for the viral and immune parameters were generated using data from clinical reports published in the last year for each of the selected parameters. This table provides the ranges and the related references for the values for all parameters and figure shows the box-and-whisker plots for the distribution of the values for each parameter (Table  1) (Figure 2).

Data modeling
We have applied multiple linear regression approach to the simulated data set for COVID-19 subjects generated and analyzed to understand the impact of each of the parameters on the outcome of disease severity. We chose a multiple regression model since both, the outcomes and predictors, were numeric. We used regression models to establish a predictive transfer function and evaluated significance of results. In this model, the relationship between independent variables (x 1 , x 2 …x n ) with dependent variable (y) can be visualized by the equation, y=f (x 1 , x 2 …x n ). This is the transfer function that is derived through analysis. The validity of the model was established using 'Goodness of Fit' and ANOVA. The statistical significance of the model was tested by evaluating residuals and F Ratio in one-way ANOVA, based on the criteria of p<0.05 and goodness-of-fit with adjusted R-squared>90%. The assumption for this analysis was that each of the parameters were independent. However, in cases where factual patient datasets will be subjected to this type of analysis, there may be multi-co-linearity within the parameters that should be rationalized using dimensionality reduction methods [14,15].
Since the model may not exhibit multiple combination of parameters in the limited dataset of 45 subjects, we have used resampling methods using Monte Carlo simulation to achieve a better density of combinations. The simulation was applied for resampling of the transfer function with 2000 runs, where a convergence was achieved after multiple runs. The simulation was performed in order to understand the impact of possible parameter combinations on clinical outcomes. Monte Carlo simulation uses random variates from selected range of values to model the impact of progression of events leading to outcomes.

Data analysis using training and testing data sets
Model building involved partitioning the data set into 'training' and 'testing' sets. We apportioned 70% of the data to train the model and used the remaining 30% to test the model, using random selection algorithms. Following development of the model, we analyzed a set of test data to compare predicted versus observed results to validate the model.
The linear coefficients of the prediction equation determined the weights of each parameter to predict the clinical outcome.

Estimation of the coefficients of input parameters
The modeling approach was based on utilizing clinical and laboratory parameters to fit the regression models. Since direct comparison of regression coefficients was not necessary, and interactions in factors were not considered on account of assumption of independence of factors, we chose to leave the factor-data in the original scales.

Rationale for the parameters included in the analysis
The input parameters selected for this model, which requires cause (clinical and laboratory parameters) and effect (clinical outcome) relationships, were based on the data reported in recent scientific publications. The Figure 1 shows the schematic representation of the stage of disease progression and parameters associated with the increasing severity of diseases. The following parameters were chosen: neurological, cardiac and lung and kidney disease have been reported to contribute towards severity of COVID-19 [16,17]. The simulated data for comorbidity was generated using an arbitrary range of 1 to 4, where 1 represented a healthy individual and 4 represented an individual with a severe comorbidity.
Age: A range of 18 to 100 years was utilized for generating the stimulated data set.
The assumption used in generating the data was that disease progression was directly proportional to age [17]. Reports of certain rare pathogenic conditions in children, example Kawasaki disease, have not been considered in the current model [18,19]. Reports indicate that majority of children infected with SARS-CoV-2 are asymptomatic [19].
Viral load: SARS-CoV-2 infects individuals through the nasopharyngeal pathway. This infection is the cause of all subsequent effects. Viral load is measured by reversetranscriptase quantitative PCR (RT-qPCR), which detects viral RNA from nasopharyngeal swabs [20]. The test relies on multiple cycles of RNA amplification to produce detectable amount of RNA in the mixed nucleic acid sample, reflected in the Cycle-time (Ct) value, which is defined as the number of cycles necessary to detect the virus. A Ct value of less than 20 is considered a high viral load while a Ct value of 35 and higher indicates a lower level or near absence of viral infection [20]. Viral load in patients is dependent on various factors, including number of ACE2 and TMPRSS2 receptors, comorbidities, cytokines, number of viral particles at infection, and the overall immune health status of the patients [21][22][23][24][25][26]. Viral loads have been demonstrated to have a direct correlation with severity of disease and mortality in COVID-19 [27,28].
Cytokine Storm: High viral loads evoke defensive mechanisms that can induce inflammation leading to a dysregulated innate immune response that could result in a cytokine storm characterized by fever-inducing levels of cytokines such as IL6, IFNα, IL1β and CXCL-10 [27,[29][30][31][32][33]. CXCL-10, interestingly was also found to be indicative of severe outcomes in patients affected by the SARS CoV1 outbreak in 2002 [34]. Cytokine storm has been implicated in contributing to pulmonary immunopathology, leading to severe clinical disease and mortality. In this model, we have included levels of IFNα and IL6 obtained from the published data.
Systemic Inflammation: Laboratory based parameters indicating inflammation in the serum, such as D-Dimer and Ferritin, have been shown to lead to a reduction in blood oxygen saturation levels, reflecting inadequate oxygenation in the lungs [35,36].
Lymphopenia: Viral infection can lead to marked lymphopenia that can affect both CD4+ and CD8+ T cells [3,28,36]. Lymphopenia likely due to sequestration and cell death reflected by significantly reduced CD4 and CD8 T cells in peripheral blood, has been reported in moderate and severe COVID-19 patients. In addition, antigen specific CD8 Cytotoxic T lymphocyte (CTL) responses have been detected approximately a week following viral infection, and the magnitude of the response was observed to have protective or damaging effects [37].
Neutralizing antibodies: Neutralizing antibodies bind to specific surface receptors on infectious agents such as viruses and toxins, reducing or eliminating their ability to exert harmful effects on cells. SARS-CoV-2 infected individuals generate a robust and longlasting neutralizing antibody response targeting spike protein epitopes, and plasma from convalescent COVID-19 patients has been used for treatment of severe disease with some success [38,39]. It has recently been reported that neutralizing antibodies to SARS-CoV-2 can predict severity and survival, with higher titers being associated with severe disease in some instances [40].

RESULTS
We evaluated multiple approaches to develop mathematical models using parameters that can predict the progression of disease. Candidate parameters were selected from mechanistic understanding of the process of pathogenesis of COVID-19 to evaluate their possible impact on the clinical outcome. Regression methods utilize data to build predictive models. Hypotheses are examined and confirmed with pre-determined statistical confidence and inferential power. These models incorporate all the experimental variability in the data set. Since the models contained numeric factors and numeric ordinal outcomes, we utilized methods of Multiple Linear Regression [41]. In this approach, we used the simulated data set from COVID-19 affected subjects, organized, and analyzed it to understand the variability of each of the parameters.

Regression modeling approach
The data set was parsed into training and testing partitions using methods of randomization. The validity of the model was based on goodness-of-fit of R-squared>90% and ANOVA, [p value<0.05] and a consequent F Ratio. These statistical results confirmed acceptable degree of predictability of the model (Tables 2a and 2b).
Following this multiple-regression analysis, we conducted 2,000 bootstrap samplings using the predicted coefficients and random variates from chosen intervals of parameters. The assumption for this analysis was that each of the parameters were independent variables. The coefficients of each parameter were determined by using multiple regression analyses, which is the multiplier to the parameter value in a linear regression equation. The inclusion of all the variables in analysis ensures their contribution to the model [41]. However, analysts applying this model in the future may, at their judgment, evaluate statistical significance of regression coefficients. Parameters that are not significant maybe excluded using step wise regression. In our analysis, results based on training dataset predictors matched with those from the test dataset confirming an acceptable degree of predictability of the model. We invite the readers of this article to contact us to analyze the predictive potential of the model using their clinical data.

Monte Carlo simulation
To determine the factors that contribute to the clinical outcome at the population level, Monte Carlo simulation was performed on a sample set of laboratory and clinical parameters covering the full range, from asymptomatic to severe disease, of outcomes in Figures 3,4 [12,13]. The histogram and cumulative data show the distribution of asymptomatic to severe outcomes. The tornado chart shows the sensitivity of parameter to the outcome in the selected range (Figures 3a and 3b).

The predictive model
Based on the correlation coefficient of the parameters and the outcome from the training data set, we developed a model using the prediction equation . The table 3 shows the process of predicting the outcome. When the numerical values of the individual parameters for each patient are entered into the columns, the model predicts the outcome. Among the 7 'Test Subjects', six of the subjects were 'predicted' an outcome that was same as the 'observed'. The validation of the model will require data from patients and subjects from clinical trials. The goal of this exercise was to develop a model that can be used to predict the outcome in a large number of patients ( Figure 4) (Table 3).

DISCUSSION
We have evaluated multiple regression analysis for mathematically modeling the course of COVID-19 and predict clinical outcome. The premise of this model is that quantitatively measured clinical and laboratory parameters involved in the pathogenesis of disease progression can be mathematically mapped to a multiple-regression model. COVID-19 is initiated by infection of the subject with SARS-CoV-2 with subsequent replication in the epithelial cells of the lung. The factors that contribute to the viral load include number of cells that express the ACE2 and other receptors, and inflammatory cytokines. Comorbidities contribute towards a more serious disease progression. Virus infection of antigen presenting cells, such as dendritic cells, macrophages, and other cell types including endothelial cells, result in activation of biochemical signals, which lead to secretion of a battery of cytokines that include IL1β and IL-6. The viral infections as well as inflammatory cytokines cause fever and an increase in serum inflammatory factors such as D-Dimer and Ferritin. Induction of an inflammatory response contributes to reduction of the total numbers of lymphocytes from circulation. The inflammation results in a loss of lung function (reduction in blood-oxygen levels), cardiac function (blood pressure fluctuation) and can culminate in multi-organ failure.
Subjects with a normal immune response can generally mount an adequate innate and adaptive response to the virus. These individuals clear the virus by generating adaptive T cell responses and neutralizing antibodies. Subjects with comorbid conditions can have compromised immune function which could result in dysfunctional activation of inflammatory responses, leading to worse clinical outcomes.
Selection of the parameters that were included in the model building process was influenced by their perceived significance from current research reports. This list of factors is by no means complete and it is expected that in due course a more comprehensive list will emerge. This report provides a basis for creating a tool, independent of the number and type of parameters that could find utility in predicting the disease outcome using those parameters.

Viral Load:
Association of viral load and progression of diseases has been reported for several viral infections [42][43][44]. Viral load in COVID-19 is measured by RT-qPCR of SARS-CoV-2 using primers for the spike gene [43]. The correlation of high viral load with severity of disease progression has been extensively demonstrated. The systemic dissemination of the virus has been associated with expression of the ACE2 receptor on endothelial cells [21]. Comorbid conditions could enhance the expression of receptors and enable distribution of virus, thereby enhancing the viral load, which can result in progression of disease.

IFNα:
The critical role of Type I interferons in innate and adaptive immunity, leading to both protective and pathogenic responses, has been reported in the case of several viral and bacterial infections [45]. SARS-CoV-2 infection has been shown to result in a diverse range of effects on Type I immune responses. Most patients elicit a strong IFNα response along with a battery of inflammatory cytokines, some of which progress to a cytokine storm [46,47]. Specific blocking of the type I mediated signal transduction by various proteins of SARS-CoV-2 has been demonstrated [48]. A remarkably high proportion of male subjects experiencing severe or critical COVID-19 disease expressed an inability to produce sufficient levels of IFNα due to various types of errors in the IFN genes. Curiously, majority of the male subjects possessed circulating IFNα autoantibodies that had the ability to neutralize the endogenously produced cytokine, thereby effectively reducing the available IFNα. The discovery of these two mechanisms for lowering IFNα levels underscores its relevance in controlling the progression of disease in individuals infected with the SARS-CoV-2 [49].

D-Dimer:
D-Dimer is routinely measured in clinical situations because its levels correlate with serious underlying conditions including venous thromboembolism, cancer and sepsis [48]. In the case of COVID-19 patients, introduction of the virus brings about infection-induced inflammatory alterations leading to coagulopathy. Lungs being the target of SARS-CoV-2, acute injury to the lung as well as multi organ failure have been caused by the virus-induced cascade of the inflammatory pathway. In an early study on 41 COVID-19 patients, those with severe disease had higher levels of D-Dimer along with high levels of IL-8, TNFα and IL-2R [31]. Male patients were found to have higher levels of IL-6, IL-2R, Ferritin and other markers of inflammation compared to female. High levels of IL-6 showed a statistically significant correlation with severe disease in a retrospective study as well [27]. One can hypothesize that such patients would likely benefit from anticoagulation therapy.

Ferritin:
A high level of Ferritin, measure of stored iron, was found to be associated with severe disease in COVID-19 patients and was linked to high fatality rates in a 72 patient prospective study [33,50,51]. In another study on 39 patients, those with mild COVID-19 symptoms had lower levels of Ferritin while those with moderate or severe symptoms expressed higher levels of Ferritin [50].

Lymphopenia:
Loss of lymphocytes after viral infections has been associated with severe disease. The mechanisms involved in lymphodepletion include cell death, cytokine storm and/or redistribution of lymphocyte populations [3,33,37]. In this model, we have utilized lymphopenia as a measure of severity of disease progression. Loss of immune function could result in several potential mechanisms of pathogenesis including autoimmunity, hyperactivation, increased susceptibility to infections and organ dysfunction.

Neutralizing antibodies:
Induction of neutralizing antibodies directed to the receptor-binding domain of the spike protein is critical for restricting entry of the virus into the cells and has been one of the central tenets of a protective immune response. In this model, we have used a range of IgG titers to spike protein for the simulated data set [52]. However, the role of neutralizing antibodies induced in a large proportion of subjects following natural infection is still being studied [53]. Some subjects do not elicit strong antibody responses. Sub-optimal levels of antibodies may catalyze generation of virus mutants [54]. Neutralizing antibodies to the virus have generally not correlated with reduced severity of disease in the primary infection. In addition, it will be interesting to decipher the role of pre-existing antibodies reported recently in the modulation of disease and its impact on vaccination regimens. Thus, the mechanisms involved in the induction of antibodies, the repertoire and diversity of responses, and effects on protection versus progression, remains to be clearly established [55][56][57].
The predictive model can have multiple applications, such as forecasting the percentage of the population that will progress to severe disease in each geography, enabling logistics planning for hospital beds, health care providers and personal-protective safety equipment. Analysis of the coefficient of correlations of parameters with outcome of disease may provide clues to a better understanding of the mechanism of action of disease pathogenesis. The model can predict the probability of disease progression at an individual level, based on parameter data, and can be used to understand the effect and impact of therapeutic interventions. The predictive model can be utilized to analyze large amounts of data to develop algorithms for personalized treatment regimens.

CONCLUSION
In conclusion, we have developed a probabilistic model that can be utilized to predict progression of disease following infection with SARS-CoV-2. This model was developed using simulated data based on published levels of COVID-19 related clinical and laboratory parameters and provides an approach to predicting the outcome of disease. Validation of the model will require existing data and the clinical outcomes of patients. Prediction of disease progression can be highly valuable at an individual as well as population level.  Box-and-whisker plots of the simulated data: the figures show the visual representation of the summary, which includes median (q2/50th percentile); first quartile (q1/25th percentile); third quartile (q3/75th percentile); interquartile range in whiskers, maximum and outliers.  Histogram from Monte Carlo simulation: 2,000 bootstrap samplings were generated using the predicted coefficients from the linear regression analysis, from the intervals of parameters; the minimum and maximum values for each of the parameters were set to the levels in Figure 4;the distribution of the severity of outcome is in this frequency histogram; the values on the x axis denote the disease severity, and y axis denotes frequency of the population in each level of clinical outcome.   The ranges (Maximum and Minimum) of each of the parameters on which the Monte Carlo simulation was performed.    The prediction of outcome based on observed and predicted values. The values of the parameters for each of the seven subjects are entered in columns, upon running of the model. The predicted values are calculated in numerical values in a range of 1-7, with 1 being asymptomatic, and 7 most severe.